AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-17-2026, 03:59:04 GMT

a3eadeebbc9eecd621086f6978865a85-Paper-Conference.pdf

machine learning, natural language, recognition, (20 more...)

Country: Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Nguyen, Minh Hoang, Thiet, Su Nguyen

Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

arXiv.org Artificial IntelligenceDec-2-2025

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5

machine learning, natural language, pattern recognition, (18 more...)

2510.04003

Country: Asia > Vietnam (0.30)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.73)

Neural Information Processing SystemsOct-10-2025, 12:01:43 GMT

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

A mainstream of Multi-modal Large Language Models (MLLMs) have two essential functions, i.e., visual recognition ( e.g., grounding) and understanding ( e.g.,

query, recognition, recognition result, (16 more...)

Country: Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceSep-8-2025

Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Takahashi, Ryo, Saito, Naoki, Maeda, Keisuke, Ogawa, Takahiro, Haseyama, Miki

Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

large language model, machine learning, natural language, (16 more...)

2509.0448

Country: Asia > Japan > Hokkaidō (0.15)

Genre: Research Report (1.00)

Industry:

Transportation > Air (0.42)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)

arXiv.org Artificial IntelligenceJul-8-2025

Experiment on creating a neural network with weights determined by the potential of a simulated electrostatic field

Polad, Geidarov

This paper explores the possibility of determining the weights and thresholds of a neural network using the potential -- a parameter of an electrostatic field -- without analytical calculations and without applying training algorithms. The work is based on neural network architectures employing metric recognition methods. The electrostatic field is simulated in the Builder C++ environment. In the same environment, a neural network based on metric recognition methods is constructed, with the weights of the first-layer neurons determined by the values of the potentials of the simulated electrostatic field. The effectiveness of the resulting neural network within the simulated system is evaluated using the MNIST test dataset under various initial conditions of the simulated system. The results demonstrated functional viability. The implementation of this approach shows that a neural network can obtain weight values almost instantaneously from the electrostatic field, without the need for analytical computations, lengthy training procedures, or massive training datasets.

artificial intelligence, machine learning, neural network, (18 more...)

doi: 10.3103/S0147688222050161

2507.02933

Country: Asia (0.46)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsMay-27-2025, 11:19:39 GMT

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

A mainstream of Multi-modal Large Language Models (MLLMs) have two essential functions, i.e., visual recognition (e.g., grounding) and understanding (e.g., visual question answering). Presently, all these MLLMs integrate visual recognition and understanding in a same sequential manner in the LLM head, i.e., generating the response token-by-token for both recognition and understanding. We think unifying them in the same sequential manner is not optimal for two reasons: 1) parallel recognition is more efficient than sequential recognition and is actually prevailing in deep visual recognition, and 2) the recognition results can be integrated to help high-level cognition (while the current manner does not). Such motivated, this paper proposes a novel "parallel recognition sequential understanding" framework for MLLMs. The bottom LLM layers are utilized for parallel recognition and the recognition results are relayed into the top LLM layers for sequential understanding.

llm layer, multi-modal llm, recognition, (5 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMay-19-2025

A Multi-modal Fusion Network for Terrain Perception Based on Illumination Aware

Wang, Rui, Yang, Shichun, Chen, Yuyi, Li, Zhuoyang, Tong, Zexiang, Xu, Jianyi, Lu, Jiayi, Feng, Xinjie, Cao, Yaoguang

Road terrains play a crucial role in ensuring the driving safety of autonomous vehicles (AVs). However, existing sensors of AVs, including cameras and Lidars, are susceptible to variations in lighting and weather conditions, making it challenging to achieve real-time perception of road conditions. In this paper, we propose an illumination-aware multi-modal fusion network (IMF), which leverages both exteroceptive and proprioceptive perception and optimizes the fusion process based on illumination features. We introduce an illumination-perception sub-network to accurately estimate illumination features. Moreover, we design a multi-modal fusion network which is able to dynamically adjust weights of different modalities according to illumination features. We enhance the optimization process by pre-training of the illumination-perception sub-network and incorporating illumination loss as one of the training constraints. Extensive experiments demonstrate that the IMF shows a superior performance compared to state-of-the-art methods. The comparison results with single modality perception methods highlight the comprehensive advantages of multi-modal fusion in accurately perceiving road terrains under varying lighting conditions. Our dataset is available at: https://github.com/lindawang2016/IMF.

artificial intelligence, lighting condition, machine learning, (18 more...)

2505.11066

Country: Asia > China (0.47)

Genre: Research Report > New Finding (0.68)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.94)
(2 more...)

arXiv.org Artificial IntelligenceMar-5-2025

Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution

Zhu, Jizhao, Shi, Akang, Li, Zixuan, Bai, Long, Jin, Xiaolong, Guo, Jiafeng, Cheng, Xueqi

In this paper, we aim to enhance the robustness of Universal Information Extraction (UIE) by introducing a new benchmark dataset, a comprehensive evaluation, and a feasible solution. Existing robust benchmark datasets have two key limitations: 1) They generate only a limited range of perturbations for a single Information Extraction (IE) task, which fails to evaluate the robustness of UIE models effectively; 2) They rely on small models or handcrafted rules to generate perturbations, often resulting in unnatural adversarial examples. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic perturbations across different IE tasks. Based on this dataset, we comprehensively evaluate existing UIE models and reveal that both LLM-based models and other models suffer from significant performance drops. To improve robustness and reduce training costs, we propose a data-augmentation solution that dynamically selects hard samples for iterative training based on the model's inference loss. Experimental results show that training with only \textbf{15\%} of the data leads to an average \textbf{7.5\%} relative performance improvement across three IE tasks.

large language model, machine learning, natural language, (18 more...)

2503.03201

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
Asia > China > Liaoning Province > Shenyang (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Kuan, Chun-Yi, Lee, Hung-yi

Gender Bias in Instruction-Guided Speech Synthesis Models

arXiv.org Artificial IntelligenceFeb-8-2025

Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abstract style prompts. This study investigates the potential gender bias in how models interpret occupation-related prompts, specifically examining their responses to instructions like "Act like a nurse". We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts. Our experimental results reveal the model's tendency to exhibit gender bias for certain occupations. Moreover, models of different sizes show varying degrees of this bias across these occupations.

artificial intelligence, speech synthesis, style prompt, (13 more...)

2502.05649

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Taiwan (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)